This R markdown summarizes the geographic distribution of different host tree genera using occurrence data from AutoArborist, OpenTrees, iNaturalist, and the USFS Forest Inventory and Analysis (FIA) dataset.
Understanding the distribution and relative abundance of host tree genera across the US may be informative as priors to improve image classification of tree genera.
Analyses Questions
What are the observed geographic distributions of tree genera in the United States?
How do different data sources (AutoArborist, OpenTrees, iNaturalist, FIA) vary in their characterization of tree genera distributions? Do urban trees genera have a different relative abundance than in natural settings?
How can we characterize geographic distribution shifts in tree genera to potentially improve image classification?
The AutoArborist Dataset (Beery et al., 2022) describe the importance of distribution shifts in tree genera on the accuracy of tree genera classification using CNNs.
“One of our primary challenges is to be able to do well on novel cities that were not part of the training set, but in order for a model to do so, it will have to contend with distribution shift, where the training distribution of cities differs from the novel test distribution on some new city.”
Label shift refers to when the marginal distribution of labels (genera) differs from city to city even if the appearance of an image does not change. This simply means that species distributions vary geographically (e.g., we tend to see Palm trees in Southern California and less in Canada).
Figure 4 visualizes the distribution shift between pairs of cities using the L1 norm distance between normalized genus distributions.
There is a strong correlation between distribution similarity and performance, notably models can achieve the same accuracy with significantly less training data if their distributions are similar.
Put simply, tree genus classification accuracy declines when the distribution of genera differ between training and testing.
The authors use the L1 norm to describe the distribution shifts between cities with different tree genera.
The L1 norm, also known as the Manhattan distance, between two vectors \(\mathbf{u}\) and \(\mathbf{v}\) is calculated as:
\[ ||\mathbf{u} - \mathbf{v}||_1 = \sum_{i=1}^{n} |u_i - v_i| \]
where \(n\) is the dimensionality of the vectors.
## [1] "Example: The distribution of tree genera in city 1 (vector1) and city 2 (vector2) can be compared using the L1 norm."
## [1] "Count of tree genera in city 1:"
## [1] 1 2 3 4 5 4 3 2 1 2 1
## [1] "Count of tree genera in city 2:"
## [1] 8 7 7 2 3 1 2 0 2 1 2
## [1] "L1 distance between vector1 and vector2: 29"
## [1] "There are 23 cities sampled in Autoarborist."
## [1] "There are 309 tree genera sampled from Autoarborist."
## [1] "There are 1155612 tree genera records sampled from Autoarborist."
## [1] "Example City: Sioux Falls, SD"
##
## acer fraxinus quercus ulmus prunus tilia
## 82255 64204 56558 55309 53582 46406
## pyrus gleditsia malus platanus liquidambar pinus
## 34918 33466 33443 30024 26319 24467
## magnolia picea ginkgo zelkova celtis crataegus
## 23027 21893 20614 18058 17974 17517
## populus carpinus
## 16468 15952
## [1] "Abundance of tree genera in cities of North America from OpenTrees"
## [1] "We compare tree genera between cities by first normalizing the number of tree genera counts. For cities with no tree genera present, we set that count to zero."
## [1] "The first ten cities and ten genera normalized per city."
## City abies abutilon acacia
## 1 Bloomington 0.000218866272707376 0 0
## 2 Boulder 0.00274622394207964 0 0
## 3 Buffalo 9.86436498150432e-05 0 0
## 4 Calgary 0.000384583688157569 0 0
## 5 Cambridge 0.00344717011383678 0 0
## 6 Columbus 4.14353194663131e-05 0 0
## 7 Denver 0.00106410911955335 0 0
## 8 Edmonton 5.96445186687343e-05 0 0
## 9 Kitchener 0.00026214037617144 0 0
## 10 LosAngeles 2.03194213028813e-05 6.77314043429376e-06 0.00992265073624037
## acca acer acrocomia aesculus afrocarpus
## 1 0 0.226745458524841 0 0.000656598818122127 0
## 2 0 0.138184995631007 0 0.00291266175675113 0
## 3 0 0.242466091245376 0 0.0301849568434032 0
## 4 0 0.027690025547345 0 0.00255473449990385 0
## 5 0 0.36307519640853 0 0.0041686708353375 0
## 6 0 0.099306648987597 0 0.00276235463108754 0
## 7 0 0.0546703334669228 0 0.0110971379610564 0
## 8 0 0.0498926398663963 0 0.00745556483359179 0
## 9 0 0.268169604823383 0 6.553509404286e-05 0
## 10 0 0.00701020034949405 0 4.06388426057626e-05 0
## agathis
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 1.35462808685875e-05
## [1] "We can calculate and compare the L1 norm distance of tree genera counts between cities."
## [1] "L1 distances of tree genera between cities shown using a dendrogram."
## [1] "Cities with similar distributions of tree genera cluster compared to dissimilar distributions."
## [1] "The L1 norm calculated between cities shows patterns of similar and dissimilar tree genera."
## [1] "L1 norm considers the absolute differences between two vectors."
## [1] "L2 norm considers both the magnitude and direction of differences between two vectors"
## [1] "There are 293 tree genera sampled from Open Trees."
## [1] "There are 5987025 tree genera records sampled from Open Trees."
## [1] "There are 70 cities sampled by Open Trees."
## [1] "The total count and distribution of tree genera sampled from OpenTrees across 70 cities."
##
## acer fraxinus ulmus quercus picea prunus
## 1031057 592574 408195 379096 337959 292677
## tilia gleditsia populus platanus pinus malus
## 281159 241142 184923 182368 168242 158180
## pyrus syringa ginkgo liquidambar celtis betula
## 155733 85833 70740 62395 61908 58684
## zelkova magnolia
## 55300 52945
## [1] "Abundance of tree genera in cities of North America from OpenTrees"
## [1] "We compare tree genera between cities by first normalizing the number of tree genera counts. For cities with no tree genera present, we set that count to zero."
## [1] "The first ten cities and ten genera normalized per city."
## source abies abutilon acacia acca
## 1 auburn_me 0.0194697597348799 0 0 0
## 2 berkeley 0.000438994410137844 0 0.00798969826450877 0
## 3 boulder 0.00517255152580284 0 0 0
## 4 bozeman_mt 0.0010688591983556 0 0 0
## 5 buffalo-ny 0.000891099643560143 0 0 0
## 6 calgary 0.000131245078309563 0 0 0
## 7 cambridge 0.00156212856997317 0 0 0
## 8 champaign_il 0.000420698359276399 0 0 0
## 9 cornell 0.00696290669705026 0 0 0
## 10 denver 0.0015654572190444 0 0 0
## acer acrocomia aesculus afrocarpus agathis
## 1 0.43827671913836 0 0.00248550124275062 0 0
## 2 0.0851356492727326 0 0.0126137727179607 0 0
## 3 0.116012941364435 0 0.0040741332481227 0 0
## 4 0.170030832476876 0 0.0055087358684481 0 0
## 5 0.346065861573655 0 0.0237005905197638 0 0
## 6 0.0130640934298297 0 0.00170826927323559 0 0
## 7 0.194688762862091 0 0.0030902978232078 0 0
## 8 0.331136353012668 0 0.00331884261206937 0 0
## 9 0.120394986707178 0 0.0035447525003165 0 0
## 10 0.145126678908894 0 0.00609297056940428 0 0
## [1] "L1 distances of tree genera between cities shown using a dendrogram."
## [1] "Cities with similar distributions of tree genera cluster compared to dissimilar distributions."
## [1] "The L1 norm calculated between cities shows patterns of similar and dissimilar tree genera."
## [1] "L1 norm considers the absolute differences between two vectors."
## [1] "L2 norm considers both the magnitude and direction of differences between two vectors"
## [1] "There are 278 tree genera selected from iNaturalist"
## [1] "There are 6349840 tree genera records sampled from iNaturalist."
## [1] "The top reported tree genera in iNaturalist"
##
## quercus acer pinus rubus solanum
## 394006 304046 288224 226971 189134
## euphorbia lonicera rhus cornus viburnum
## 175256 167183 153505 141628 132228
## prunus juniperus ilex veronica populus
## 119399 118538 116777 114632 112717
## rosa arctostaphylos sambucus ribes pieris
## 109679 93221 91521 89556 88215
## [1] "Abundance of tree genera in North America from iNaturalist"
## [1] "Do urban trees genera have a different relative abundance than in natural settings?"
## [1] "Compare OpenTrees and iNat records in NYC"
## [1] "Sampled iNat records within 0.2 decimal degrees"
## [1] "There are 76583 tree genera records from iNat in NYC."
## [1] "There are 652169 tree genera records from OpenTrees in NYC."
## [1] "Get counts of tree genera from iNat"
## [1] "Normalize counts of tree genera from iNat"
## # A tibble: 125 x 3
## genus count normalized_count
## <chr> <int> <dbl>
## 1 abies 4 0.0000522
## 2 abutilon 234 0.00306
## 3 acer 4894 0.0639
## 4 aesculus 835 0.0109
## 5 ailanthus 5088 0.0664
## 6 albizia 312 0.00407
## 7 alnus 61 0.000797
## 8 amelanchier 203 0.00265
## 9 aralia 848 0.0111
## 10 arctostaphylos 3 0.0000392
## # i 115 more rows
## [1] "Add additional genera as 0s"
## # A tibble: 309 x 3
## genus count normalized_count
## <chr> <dbl> <dbl>
## 1 abies 4 0.0000522
## 2 abutilon 234 0.00306
## 3 acacia 0 0
## 4 acca 0 0
## 5 acer 4894 0.0639
## 6 acrocomia 0 0
## 7 aesculus 835 0.0109
## 8 afrocarpus 0 0
## 9 agathis 0 0
## 10 agonis 0 0
## # i 299 more rows
## [1] "Get counts of tree genera from OpenTrees"
## [1] "Normalize counts of tree genera from iNat"
## # A tibble: 68 x 3
## genus count normalized_count
## <chr> <int> <dbl>
## 1 acer 88739 0.136
## 2 aesculus 1287 0.00197
## 3 ailanthus 756 0.00116
## 4 albizia 163 0.000250
## 5 alnus 47 0.0000721
## 6 amelanchier 2032 0.00312
## 7 betula 1400 0.00215
## 8 carpinus 4042 0.00620
## 9 carya 99 0.000152
## 10 castanea 173 0.000265
## # i 58 more rows
## [1] "Add additional genera as 0s"
## # A tibble: 309 x 3
## genus count normalized_count
## <chr> <dbl> <dbl>
## 1 abies 0 0
## 2 abutilon 0 0
## 3 acacia 0 0
## 4 acca 0 0
## 5 acer 88739 0.136
## 6 acrocomia 0 0
## 7 aesculus 1287 0.00197
## 8 afrocarpus 0 0
## 9 agathis 0 0
## 10 agonis 0 0
## # i 299 more rows
## [1] "Calculate the L1 Norm (Distance) between Tree Genera from iNat and OpenTrees in NYC"
## [1] "The L1 Distance Between Tree Genera From iNat and OpenTrees in NYC is: 1.50539668794971"
## [1] "Which tree genera are contributing most to the differnece in tree genera distributions (high L1 distance)?"
## [1] "Number of genera from iNaturalist not found in OpenTrees: 63"
## [1] "Plot difference in normalized counts per genus and dataset (OpenTrees - iNat)"
## [1] "Right: Genera more common in OpenTrees"
## [1] "Left: Genera more common in iNaturalist"
## [1] "Plot difference in normalized counts per genus and dataset (OpenTrees - iNaturalist)"
## [1] "Right: Genera more common in OpenTrees"
## [1] "Left: Genera more common in iNaturalist"
## [1] "There are 140 unique genera in the FIA dataset"
## [1] "There are 23935395 records in the FIA dataset"
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 124072741 6626.2 347459002 18556.4 434323752 23195.4
## Vcells 3795865933 28960.2 11147029376 85045.1 13933786503 106306.4
## [1] "The top reported tree genera in FIA"
##
## pinus quercus acer abies populus liquidambar
## 5403040 3405811 2437490 1171758 1128641 889301
## picea fraxinus pseudotsuga nyssa betula carya
## 695528 657422 641185 629892 625576 595846
## juniperus ulmus tsuga thuja liriodendron prunus
## 580180 507803 485558 438573 403322 342478
## fagus cornus
## 283248 229597
## [1] "Abundance of tree genera in North America from FIA"
## [1] "Do urban trees genera have a different relative abundance than in natural settings?"
## [1] "Compare OpenTrees and FIA records around NYC"
## [1] "Sampled FIA records within 0.5 decimal degrees"
## [1] "There are 7225 tree genera records from FIA around NYC."
## [1] "There are 652169 tree genera records from OpenTrees in NYC."
## [1] "Get counts of tree genera from FIA"
## [1] "Normalize counts of tree genera from iNat"
## # A tibble: 37 x 3
## Genus count normalized_count
## <chr> <int> <dbl>
## 1 acer 1875 0.260
## 2 ailanthus 123 0.0170
## 3 amelanchier 36 0.00498
## 4 betula 723 0.100
## 5 carpinus 16 0.00221
## 6 carya 256 0.0354
## 7 castanea 1 0.000138
## 8 celtis 8 0.00111
## 9 chamaecyparis 3 0.000415
## 10 cornus 28 0.00388
## # i 27 more rows
## [1] "Add additional genera as 0s"
## [1] "Get counts of tree genera from OpenTrees"
## [1] "Normalize counts of tree genera from iNat"
## # A tibble: 68 x 3
## genus count normalized_count
## <chr> <int> <dbl>
## 1 acer 88739 0.136
## 2 aesculus 1287 0.00197
## 3 ailanthus 756 0.00116
## 4 albizia 163 0.000250
## 5 alnus 47 0.0000721
## 6 amelanchier 2032 0.00312
## 7 betula 1400 0.00215
## 8 carpinus 4042 0.00620
## 9 carya 99 0.000152
## 10 castanea 173 0.000265
## # i 58 more rows
## [1] "Add additional genera as 0s"
## [1] "Calculate the L1 Norm (Distance) between Tree Genera from FIA and OpenTrees in NYC"
## [1] "The L1 Distance Between Tree Genera From FIA and OpenTrees in NYC is: 1.2181468058455"
## [1] "Which tree genera are contributing most to the differnece in tree genera distributions (high L1 distance)?"
## [1] "Number of genera from FIA not found in OpenTrees: 0"
## [1] "Plot difference in normalized counts per genus and dataset (OpenTrees - FIA)"
## [1] "Right: Genera more common in OpenTrees"
## [1] "Left: Genera more common in FIA"
## [1] "Plot difference in normalized counts per genus and dataset (OpenTrees - FIA)"
## [1] "Right: Genera more common in OpenTrees"
## [1] "Left: Genera more common in FIA"
## [1] "There are 135 unique genera in the FIA dataset"
## [1] "There are 36979 records in the FIA Urban dataset"
## [1] "The top reported tree genera in Urban FIA"
##
## acer quercus juniperus ulmus fraxinus pinus celtis prunus
## 4718 4487 4210 2591 2071 1532 1514 964
## morus rhamnus carya populus triadica ilex thuja juglans
## 819 762 750 704 637 589 541 512
## prosopis tilia salix robinia
## 489 469 458 445
## [1] "Abundance of tree genera per state in Urban FIA"